The SOLAR System for Sharp Web Archiving

نویسندگان

  • Arturas Mazeika
  • Dimitar Denev
  • Marc Spaniol
  • Gerhard Weikum
چکیده

Web archives preserve the history of digital information on the Internet. They are a great asset for media and business analysts and also for experts on intellectual property (e.g, patent offices, IP lawyers) and Internet legislators (e.g. consumer services) to prove or disprove certain allegations. To fulfill this purpose, archives should not only periodically crawl a Web site’s pages but should also assure that the captured pages are a precise representation of the Web site as of a single timepoint. This is a hard problem, since the politeness etiquette and completeness requirement of archive crawlers mandate slow, long-duration crawling while the Web site is changing. This paper presents the SOLAR (Scheduling of Downloads for Archiving of Web Sites) system for sharp Web archiving. SOLAR crawls all pages of a Web site and then re-crawls the visited pages forming visit-revisit intervals. If all visit-revisit intervals overlap and no page changed between its visit and revisit then all pages are “sharp” and captured as if the entire site were downloaded instantaneously. SOLAR judiciously schedules visits and revisits to maximize the number of sharp pages based on the predictions of page-specific change rates. Experiments with synthetic date show SOLAR outperforms existing techniques and captures the sites as sharp as possible.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study of the Attitude of Users towards Picture Archiving and Communication System Based on the Technology Acceptance Model in Teaching Hospitals of Qom, Iran

Background and Objectives: Many healthcare providers use health information technology to improve their performance. Picture Archiving and Communication System is a subsystem of the health information system that aims to facilitate the storing, archiving, and managing of digital images as well as their transmission. In this regard, measuring the level of acceptance of technology can be very hel...

متن کامل

Users’ satisfaction with imaging services before and after the implementation of picture archiving and communication system

  Introduction: The picture archiving and communication system is a digital device designed for processing, archiving and communicating medical images with different parts of hospitals, physicians and radiologists. Therefore, the current study aimed to determine the impact of the system on users’ satisfaction with imaging services before and after its implementation. Methods: This cross-secti...

متن کامل

Users' Views on the Benefits of Using the Picture Archiving and Communication System (PACS) in hospitals affiliated with Mashhad University of Medical Sciences in 2016

Background & Aim: Picture Archiving and Communication System (PACS) is used for storing, retrieving and displaying medical images and transmitting electronic reports. One of the important factors in accepting this system is users. The purpose of this study was to evaluate the users' views on the benefits of using picture archiving and communication system (PACS) in hospitals affiliated with Mas...

متن کامل

A Policy-based Institutional Web Archiving System with Adjustable Exposure of Archived Resources

Despite the recognition that Archiving Web content is important and Web archiving systems freely crawl and collect resources on the Web for preservation, they have difficulties in collecting all versions of a single Web page and in preserving a collected resource with policies of use given by its creator. In order to solve the problems, we have proposed an Institutional Web Archiving System, wh...

متن کامل

Web archiving in a Web 2.0 world

The National Library of Australia is the lead institution for digital archiving and preservation in Australia. Its PANDORA Archive has been the repository for archived web resources in Australia for over ten years and is a mature but continually developing system. The archival management system PANDAS that underpins the Archive, is as of 2007, in its third major revision. Other web archiving ac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010